HVT: An Introduction

Zubin Dowlaty, Shubhra Prakash, Sangeet Moy Das, Praditi Shah, Shantanu Vaidya, Somya Shambhawi, Vishwavani

2024-04-26

1. Abstract

The HVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data analysis. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below:

  1. Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective.

  2. Data Projection: Dimension projection of the compressed cells to 1D,2D or Interactive surface plot with the Sammons Non-linear Algorithm. This step creates topology preserving map (also called an embedding) coordinates into the desired output dimension.

  3. Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map useful for semi-supervised tasks.

  4. Scoring: Scoring new data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required.

2. Data Compression

Compression is a technique used to reduce the data size while preserving its essential information, allowing for efficient storage and decompression to reconstruct the original data. While Vector quantization (VQ) is a technique used in data compression to represent a set of data points with a smaller number of representative vectors. It achieves compression by exploiting redundancies or patterns in the data and replacing similar data points with representative vectors.

This package offers several advantages for performing data compression as it is designed to handle high-dimensional data more efficiently. It provides a hierarchical compression approach, allowing multi-resolution representation of the data. The hierarchical structure enables efficient compression and storage of the data while preserving different levels of detail. HVT aims to preserve the topological structure of the data during compression. Spatial data with irregular shapes and complex structures in high-dimensional data can contain valuable information about relationships and patterns. HVT seeks to capture and retain these topological characteristics, enabling meaningful analysis and visualization.This package employs tessellation to divide the compressed data space into distinct cells or regions while preserving the topology of the original data. This means that the relationships and connectivity between data points are maintained in the compressed representation.

This package can perform vector quantization using the following algorithms-

2.1 Hierarchical Vector Quantization

2.1.1 Using k-means

  1. The k-means algorithm randomly selects k data points as initial means.
  2. k clusters are formed by assigning each data point to its closest cluster mean using the Euclidean distance.
  3. Virtual means for each cluster are calculated by using all datapoints contained in a cluster.

The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(n).

2.1.2 Using k-medoids

  1. The k-medoids algorithm randomly selects k data points as initial means out of the n data points as the medoids.
  2. k clusters are formed by assigning each data point to its closest medoid by using any common distance metric methods.
  3. Virtual means for each cluster are calculated by using all datapoints contained in a cluster.

The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(k * (n-k)^2).

These algorithm divides the dataset recursively into cells using \(k-means\) or \(k-medoids\) algorithm. The maximum number of subsets are decided by setting \(n_cells\) to, say five, in order to divide the dataset into maximum of five subsets. These five subsets are further divided into five subsets(or less), resulting in a total of twenty five (5*5) subsets. The recursion terminates when the cells either contain less than three data point or a stop criterion is reached. In this case, the stop criterion is set to when the cell error exceeds the quantization threshold.

The steps for this method are as follows:

  1. Select k(number of cells), depth and quantization error threshold.
  2. Perform quantization (using \(k-means\) or \(k-medoids\)) on the input dataset.
  3. Calculate quantization error for each of the k cells.
  4. Compare the quantization error for each cell to quantization error threshold.
  5. Repeat steps 2 to 4 for each of the k cells whose quantization error is above threshold until stop criterion is reached.

The stop criterion is when the quantization error of a cell satisfies one of the below conditions:

  • reaches below quantization error threshold.
  • there are less than three data points in the cell.
  • the user specified depth has been attained.

The quantization error for a cell is defined as follows:

\[QE = \max_i(||A-F_i||_{p})\]

where

  • \(A\) is the centroid of the cell
  • \(F_i\) represents a data point in the cell
  • \(m\) is the number of points in the cell
  • \(p\) is the \(p\)-norm metric. Here \(p\) = 1 represents L1 Norm and \(p\) = 2 represents L2 Norm

2.1.3 Quantization Error

Let us try to understand quantization error with an example.

Figure 1: The Voronoi tessellation for level 1 shown for the 5 cells with the points overlayed

Figure 1: The Voronoi tessellation for level 1 shown for the 5 cells with the points overlayed

An example of a 2 dimensional VQ is shown above.

In the above image, we can see 5 cells with each cell containing a certain number of points. The centroid for each cell is shown in blue. These centroids are also known as codewords since they represent all the points in that cell. The set of all codewords is called a codebook.

Now we want to calculate quantization error for each cell. For the sake of simplicity, let’s consider only one cell having centroid A and m data points \(F_i\) for calculating quantization error.

For each point, we calculate the distance between the point and the centroid.

\[ d = ||A - F_i||_{p} \]

In the above equation, p = 1 means L1_Norm distance whereas p = 2 means L2_Norm distance. In the package, the L1_Norm distance is chosen by default. The user can pass either L1_Norm, L2_Norm or a custom function to calculate the distance between two points in n dimensions.

\[QE = \max_i(||A-F_i||_{p})\]

Now, we take the maximum calculated distance of all m points. This gives us the furthest distance of a point in the cell from the centroid, which we refer to as Quantization Error. If the Quantization Error is higher than the given threshold, the centroid/ codevector is not a good representation for the points in the cell. Now we can perform further Vector Quantization on these points and repeat the above steps.

Please note that the user can select mean, max or any custom function to calculate the Quantization Error. The custom function takes a vector of m value (where each value is a distance between point in n dimensions and centroids) and returns a single value which is the Quantization Error for the cell.

If we select mean as the error metric, the above Quantization Error equation will look like this:

\[QE = \frac{1}{m}\sum_{i=1}^m||A-F_i||_{p}\]

3. Data Projection

Projection mainly involves converting data from its original form to a different space or coordinate system while preserving certain properties of it. By projecting data into a common coordinate system, spatial relationships, distances, areas, and other spatial attributes can be accurately measured and compared.

HVT performs projection as part of its workflow to visualize and explore high-dimensional data. The projection step in HVT involves mapping the compressed data, represented by the hierarchical structure of cells, onto a lower-dimensional space for visualization purposes, as human perception is more suited to interpreting information in lower-dimensional spaces.Users can zoom in/out, rotate, and explore different regions of the projected space to gain insights and understand the data from different perspectives.

Sammon’s projection is an algorithm used in this package to map a high-dimensional space to a space of lower dimensionality while attempting to preserve the structure of inter-point distances in the projection. It is particularly suited for use in exploratory data analysis and is usually considered a non-linear approach since the mapping cannot be represented as a linear combination of the original variables. The centroids are plotted in 2D after performing Sammon’s projection at every level of the tessellation.

Denoting the distance between \(i^{th}\) and \(j^{th}\) objects in the original space by \(d_{ij}^*\), and the distance between their projections by \(d_{ij}\). Sammon’s mapping aims to minimize the below error function, which is often referred to as Sammon’s stress or Sammon’s error.

\[E=\frac{1}{\sum_{i<j} d_{ij}^*}\sum_{i<j}\frac{(d_{ij}^*-d_{ij})^2}{d_{ij}^*}\]

The minimization of this can be performed either by gradient descent, as proposed initially, or by other means, usually involving iterative methods. The number of iterations need to be experimentally determined and convergent solutions are not always guaranteed. Many implementations prefer to use the first Principal Components as a starting configuration.

4. Tessellation

A Voronoi diagram is a way of dividing space into a number of regions. A set of points (called seeds, sites, or generators) is specified beforehand and for each seed, there will be a corresponding region consisting of all points within proximity of that seed. These regions are called Voronoi cells. It is complementary to Delaunay triangulation is a geometrical algorithm used to create a triangulated mesh from a set of points in a plane which has the property that no data point lies within the circumcircle of any triangle in the triangulation. This property guarantees that the resulting cells in the tessellation do not overlap with each other.

By using Delaunay triangulation, HVT can achieve a partitioning of the data space into distinct and non-overlapping regions, which is crucial for accurately representing and analyzing the compressed data.Additionally, the use of Delaunay triangulation for tessellation ensures that the resulting cells have well-defined shapes, typically triangles in two dimensions or tetrahedra in three dimensions.

The hierarchical structure resulting from tessellation preserves the inherent structure and relationships within the data. It captures clusters, subclusters, and other patterns in the data, allowing for a more organized and interpretable representation. The hierarchical structure reduces redundancy and enables more compact representations.

Tessellate: Constructing Voronoi Tesselation

In this package, we use sammons from the package MASS to project higher dimensional data to a 2D space. The function hvq called from the trainHVT function returns hierarchical quantized data which will be the input for construction of the tessellations. The data is then represented in 2D coordinates and the tessellations are plotted using these coordinates as centroids. We use the package deldir for this purpose. The deldir package computes the Delaunay triangulation (and hence the Dirichlet or Voronoi tessellation) of a planar point set according to the second (iterative) algorithm of Lee and Schacter. For subsequent levels, transformation is performed on the 2D coordinates to get all the points within its parent tile. Tessellations are plotted using these transformed points as centroids. The lines in the tessellations are chopped in places so that they do not protrude outside the parent polygon. This is done for all the subsequent levels.

5. Scoring

Scoring basically refers to the process of chalking up or estimating future values or outcomes based on existing data patterns.In training process, a model is developed based on historical data or a training dataset, and this model is then used to score new, unseen data. The model captures the underlying patterns, trends, and relationships present in the training data, allowing it to pin point the cell of the similar or related data points.

In this package, we use scoreHVT function to score each point in the testing dataset.

Scoring Algorithm

The Scoring algorithm recursively calculates the distance between each point in the testing dataset and the cell centroids for each level. The following steps explain the scoring method for a single point in the testing dataset:

  1. Calculate the distance between the point and the centroid of all the cells in the first level.
  2. Find the cell whose centroid has minimum distance to the point.
  3. Check if the cell drills down further to form more cells.
  4. If it doesn’t, return the path. Or else repeat steps 1 to 4 till we reach a level at which the cell doesn’t drill down further.

6. Importing Code Modules

Here is the guide to install the HVT package. This helps user to install the most recent version of the HVT package.

###direct installation###
#install.packages("HVT")

#or

###git repo installation###
#library(devtools)
#devtools::install_github(repo = "Mu-Sigma/HVT")

NOTE: At the time documenting this vignette, the updated changes were not still in CRAN, hence we are sourcing the scripts from the R folder directly to the session environment.

# Sourcing required code scripts for HVT
script_dir <- "../R"
r_files <- list.files(script_dir, pattern = "\\.R$", full.names = TRUE)
invisible(lapply(r_files, function(file) { source(file, echo = FALSE); }))

7. Example I: HVT with the Torus dataset

In this section we explore the capacity of the package to visualize multidimensional data by projecting them to two dimensions using Sammon’s projection and further used for Scoring.

Data Understanding

First of all, let us see how to generate data for torus. We are using a library geozoo for this purpose. Geo Zoo (stands for Geometric Zoo) is a compilation of geometric objects ranging from three to ten dimensions. Geo Zoo contains regular or well-known objects, eg cube and sphere, and some abstract objects, e.g. Boy’s surface, Torus and Hyper-Torus.

Here, we will generate a 3D torus (a torus is a surface of revolution generated by revolving a circle in three-dimensional space one full revolution about an axis that is coplanar with the circle) with 12000 points.

Torus Dataset

The torus dataset includes the following columns:

Lets, explore the torus dataset containing 12000 points. For the sake of brevity we are displaying first 6 rows.

set.seed(240)
# Here p represents dimension of object, n represents number of points
torus <- geozoo::torus(p = 3,n = 12000) 
torus_df <- data.frame(torus$points)
colnames(torus_df) <- c("x","y","z")
torus_df <- torus_df %>% round(4)
Table(head(torus_df))
x y z
-2.6282 0.5656 -0.7253
-1.4179 -0.8903 0.9455
-1.0308 1.1066 -0.8731
1.8847 0.1895 0.9944
-1.9506 -2.2507 0.2071
-1.4824 0.9229 0.9672

Let’s visualize the torus (donut) in 3D Space.

plot_ly(x = torus_df$x, y = torus_df$y, z = torus_df$z, type = 'scatter3d',mode = 'markers',
marker = list(color = torus_df$z,colorscale = c('red', 'blue'),showscale = TRUE,size = 3,colorbar = list(title = 'z'))) %>%
layout(scene = list(xaxis = list(title = 'x'),yaxis = list(title = 'y'),zaxis = list(title = 'z'),
aspectratio = list(x = 1, y = 1, z = 0.4)))

Figure 2: 3D Torus

Now let’s have a look at structure of the torus dataset.

str(torus_df)
#> 'data.frame':    12000 obs. of  3 variables:
#>  $ x: num  -2.63 -1.42 -1.03 1.88 -1.95 ...
#>  $ y: num  0.566 -0.89 1.107 0.19 -2.251 ...
#>  $ z: num  -0.725 0.946 -0.873 0.994 0.207 ...

Data distribution

This section displays four objects.

  1. Variable Histograms: The histogram distribution of all the variables in the dataset.

  2. Box Plots: Box plots for each numeric column in the dataset across panels. These plots will display the median and Inter Quartile Range of each column at a panel level.

  3. Correlation Matrix: This calculates the pearson correlation which is a bivariate correlation value measuring the linear correlation between two numeric columns. The output plot is shown as a matrix.

  4. Summary EDA: The table provides descriptive statistics for all the variables in the dataset.

It uses an inbuilt function called edaPlots to display the above mentioned four objects.

edaPlots(torus_df)
variable min 1st Quartile median mean sd 3rd Quartile max hist n_row n_missing
x -2.9977 -1.149025 -0.00700 -0.0014436 1.5059647 1.140325 2.9995 ▅▇▇▇▅ 12000 0
y -2.9993 -1.113325 0.01305 0.0103464 1.4855540 1.133725 2.9993 ▃▇▇▇▅ 12000 0
z -1.0000 -0.711950 0.01530 0.0044226 0.7117983 0.718550 1.0000 ▇▃▃▃▇ 12000 0

Train - Test Split

Let us split the torus dataset into train and test. We will randomly select 80% of the data as train and remaining as test.

smp_size <- floor(0.80 * nrow(torus_df))
set.seed(279)
train_ind <- sample(seq_len(nrow(torus_df)), size = smp_size)
torus_train <- torus_df[train_ind, ]
torus_test <- torus_df[-train_ind, ]

Training Dataset

Now, lets have a look at the selected training dataset containing (9600 data points). For the sake of brevity we are displaying first six rows.

rownames(torus_train) <- NULL
Table(head(torus_train))
x y z
1.7958 -0.4204 -0.9878
0.7115 -2.3528 -0.8889
1.9285 1.2034 0.9620
1.0175 0.0344 -0.1894
-0.2736 1.1298 -0.5464
1.8976 2.2391 0.3545

Now lets have a look at structure of the training data.

str(torus_train)
#> 'data.frame':    9600 obs. of  3 variables:
#>  $ x: num  1.796 0.712 1.929 1.018 -0.274 ...
#>  $ y: num  -0.4204 -2.3528 1.2034 0.0344 1.1298 ...
#>  $ z: num  -0.988 -0.889 0.962 -0.189 -0.546 ...

Data Distribution

edaPlots(torus_train)
variable min 1st Quartile median mean sd 3rd Quartile max hist n_row n_missing
x -2.9973 -1.151425 -0.01025 -0.0054752 1.5057054 1.125425 2.9995 ▅▇▇▇▅ 9600 0
y -2.9993 -1.107800 0.02090 0.0162564 1.4831739 1.137650 2.9993 ▃▇▇▇▅ 9600 0
z -1.0000 -0.706725 0.01470 0.0046349 0.7099803 0.716775 1.0000 ▇▃▃▃▇ 9600 0

Testing Dataset

Now, lets have a look at testing dataset containing(2400 data points).For the sake of brevity we are displaying first six rows.

rownames(torus_test) <- NULL
Table(head(torus_test))
x y z
-2.6282 0.5656 -0.7253
2.7471 -0.9987 -0.3848
-2.4446 -1.6528 0.3097
-2.6487 -0.5745 0.7040
-0.2676 -1.0800 -0.4611
-1.1130 -0.6516 -0.7040

Now lets have a look at structure of the test data.

str(torus_test)
#> 'data.frame':    2400 obs. of  3 variables:
#>  $ x: num  -2.628 2.747 -2.445 -2.649 -0.268 ...
#>  $ y: num  0.566 -0.999 -1.653 -0.575 -1.08 ...
#>  $ z: num  -0.725 -0.385 0.31 0.704 -0.461 ...

Data Distribution

edaPlots(torus_test)
variable min 1st Quartile median mean sd 3rd Quartile max hist n_row n_missing
x -2.9977 -1.131025 0.00015 0.0146830 1.5072073 1.193425 2.9908 ▅▇▇▇▅ 2400 0
y -2.9918 -1.131400 -0.00010 -0.0132936 1.4951131 1.111750 2.9861 ▃▇▇▇▅ 2400 0
z -1.0000 -0.733700 0.01570 0.0035733 0.7191727 0.731075 1.0000 ▇▃▃▃▇ 2400 0

7.1 Step 1: Data Compression

Note: The steps of compression, projection, and tessellation are iteratively performed until a minimum compression rate of 80% is achieved. Once the desired compression is attained, the resulting model object is used for scoring using the scoreHVT() function

The core function for compression in the workflow is HVQ, which is called within the trainHVT function. we have a parameter called quantization error. This parameter acts as a threshold and determines the number of levels in the hierarchy. It means that, if there are ‘n’ number of levels in the hierarchy, then all the clusters formed till this level will have quantization error equal or greater than the threshold quantization error. The user can define the number of clusters in the first level of hierarchy and then each cluster in first level is sub-divided into the same number of clusters as there are in the first level. This process continues and each group is divided into smaller clusters as long as the threshold quantization error is met. The output of this technique will be hierarchically arranged vector quantized data.

However, let’s try to comprehend the trainHVT function first before moving on.

trainHVT(
  dataset,
  min_compression_perc,
  n_cells,
  depth,
  quant.err,
  projection.scale,
  normalize = TRUE,
  seed = 279,
  distance_metric = c("L1_Norm", "L2_Norm"),
  error_metric = c("mean", "max"),
  quant_method = c("kmeans", "kmedoids"),
  scale_summary = NA,
  diagnose = FALSE,
  hvt_validation = FALSE,
  train_validation_split_ratio = 0.8
)

Each of the parameters of trainHVT function have been explained below:

The output of trainHVT function (list of 7 elements) have been explained below with an image attached for clear understanding.

NOTE: Here the attached image is the snapshot of output list generated from iteration 1 which can be referred later in this section

Figure 3: The Output list generated by trainHVT function.

Figure 3: The Output list generated by trainHVT function.

We will use the trainHVT function to compress our data while preserving essential features of the dataset. Our goal is to achieve data compression upto atleast 80%. In situations where the compression ratio does not meet the desired target, we can explore adjusting the model parameters as a potential solution. This involves making modifications to parameters such as the quantization error threshold or increasing the number of cells and then rerunning the trainHVT function again.

In our example we will iteratively increase the number of cells until the desired compression percentage is reached instead of increasing the quantization threshold because it may reduce the level of detail captured in the data representation

Iteration 1:

We will pass the below mentioned model parameters along with torus training dataset (containing 9600 datapoints) to trainHVT function.

Model Parameters

set.seed(240)
hvt.torus <- trainHVT(
  torus_train,
  n_cells = 100,
  depth = 1,
  quant.err = 0.1,
  normalize = FALSE,
  distance_metric = "L1_Norm",
  error_metric = "max",
  quant_method = "kmeans"
)

Let’s checkout the compression summary.

displayTable(data = hvt.torus[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 100 0 0 n_cells: 100 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans

As it can be seen from the table above, none of the 100 cells have reached the quantization threshold error. Therefore we can further subdivide the cells by increasing the n_cells parameters and then see if desired compression (80%) is reached

Let’s take a look on the 1D projection of this iteration. The output of hvq from the trainHVT function is then passed to the plotHVT function, which applies Sammon’s 1D using MASS package. The resulting 1D Sammon’s points are used to determine their corresponding cell IDs and subsequently plotted in a plotly object.

plotHVT(hvt.torus, plot.type = '1D')

Figure 4: Sammons 1D x Cell ID plot for layer 1 shown for the 100 cells in the torus training dataset

Iteration 2:

Let’s retry by increasing the n_cells parameter to 300 for torus training dataset (containing 9600 datapoints).

Model Parameters

set.seed(240)
hvt.torus2 <- trainHVT(
  torus_train,
  n_cells = 300,
  depth = 1,
  quant.err = 0.1,
  normalize = FALSE,
  distance_metric = "L1_Norm",
  error_metric = "max",
  quant_method = "kmeans"
)

Let’s checkout the compression summary again.

displayTable(data = hvt.torus2[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 300 6 0.02 n_cells: 300 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans

It can be observed from the table above that only 6 cells out of 300 i.e. 2% of the cells reached the Quantization Error threshold. Therefore we can further subdivide the cells by increasing the n_cells parameters and then see if 80% compression is reached

plotHVT(hvt.torus2, plot.type = '1D')

Figure 5: Sammons 1D x Cell ID plot for layer 1 shown for the 300 cells in the torus training dataset

Iteration 3:

Since we are yet to achieve the compression of atleast 80%, lets try again by increasing the n_cells parameter to 900 for torus training dataset (containing 9600 datapoints) .

Model Parameters

set.seed(240)
hvt.torus3 <- trainHVT(
  torus_train,
  n_cells = 900,
  depth = 1,
  quant.err = 0.1,
  normalize = FALSE,
  distance_metric = "L1_Norm",
  error_metric = "max",
  quant_method = "kmeans"
)

Let’s check the compression summary for torus.

displayTable(data = hvt.torus3[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 900 749 0.83 n_cells: 900 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans

By increasing the number of cells to 900, we were successfully able to compress 83% of the data, so we will not further subdivide the cells

We successfully compressed 83% of the data using n_cells parameter as 900, the next step involves performing data projection on the compressed data. In this step, the compressed data will be transformed and projected onto a lower-dimensional space to visualize and analyze the data in a more manageable form.

plotHVT(hvt.torus3, plot.type = '1D')

Figure 6: Sammons 1D x Cell ID plot for layer 1 shown for the 900 cells in the torus training dataset

7.2 Step 2: Data Projection

This section focusses on projecting the sammon’s dimensionality reduction from multi dimension to 2D. The following plots will have the centroids plotted in 2D space with x coordinate of the centroid points on X-axis and y coordinate of the centroid points on Y-axis.

Now let’s try to understand plotHVT function. The parameters have been explained in detail below:

plotHVT <-(hvt.results, line.width, color.vec,pch1 = 21, centroid.size = 1.5,
        title = NULL, maxDepth = NULL, child.level,hmap.cols, 
        quant.error.hmap = NULL, n_cells.hmap = NULL, 
        label.size = 0.5, sepration_width = 7,layer_opacity = c(0.5, 0.75, 0.99),
        dim_size = 1000, plot.type = '2Dhvt') 

Iteration 1:

Lets see the projected Sammons 2D onto a plane with n_cell set to 100 in first iteration.

plotHVT(hvt.torus, plot.type = '2Dproj')
Figure 7: Sammons 2D Plot for 100 cells

Figure 7: Sammons 2D Plot for 100 cells

Iteration 2:

Lets see the projected Sammons 2D onto a plane with n_cell set to 300 in second iteration.

plotHVT(hvt.torus2, plot.type = '2Dproj')
Figure 8: Sammons 2D Plot for 300 cells

Figure 8: Sammons 2D Plot for 300 cells

Iteration 3:

Lets see the projected Sammons 2D onto a plane with n_cell set to 900 in third iteration.

plotHVT(hvt.torus3, plot.type = '2Dproj')
Figure 9: Sammons 2D Plot for 900 cells

Figure 9: Sammons 2D Plot for 900 cells

7.3 Step 3: Tessellation

The deldir package computes the Delaunay triangulation (and hence the Dirichlet or Voronoi tessellation) of a planar point set according to the second (iterative) algorithm of Lee and Schacter. For subsequent levels, transformation is performed on the 2D coordinates to get all the points within its parent tile. Tessellations are plotted using these transformed points as centroids. plotHVT is the main function to plot hierarchical voronoi tessellation.

Iteration 1:

To enhance visualization, let’s generate a plot of the Voronoi tessellation for the first iteration where we set n_cells parameter as 100. This plot will provide a visual representation of the Voronoi regions corresponding to the data points, aiding in the analysis and understanding of the data distribution.

plotHVT(
  hvt.torus,
  line.width = c(0.4),
  color.vec = c("navy blue"),
  centroid.size = 0.6,
  maxDepth = 1, 
  plot.type = '2Dhvt'
)
Figure 10: The Voronoi tessellation for layer 1 shown for the 100 cells in the torus training dataset

Figure 10: The Voronoi tessellation for layer 1 shown for the 100 cells in the torus training dataset

Iteration 2:

Now, let’s plot the Voronoi tessellation for the second iteration where we set n_cells parameter to 300.

plotHVT(
  hvt.torus2,
  line.width = c(0.4),
  color.vec = c("navy blue"),
  centroid.size = 0.6,
  maxDepth = 1,
  plot.type = '2Dhvt'
)
Figure 11: The Voronoi tessellation for layer 1 shown for the 300 cells in the torus training dataset

Figure 11: The Voronoi tessellation for layer 1 shown for the 300 cells in the torus training dataset

Iteration 3:

Now, let’s plot the Voronoi tessellation again, for the third iteration where we set n_cells parameter to 900.

plotHVT(
  hvt.torus3,
  line.width = c(0.4),
  color.vec = c("navy blue"),
  centroid.size = 0.6,
  maxDepth = 1,
  plot.type = '2Dhvt'
)
Figure 12: The Voronoi tessellation for layer 1 shown for the 900 cells in the torus training dataset

Figure 12: The Voronoi tessellation for layer 1 shown for the 900 cells in the torus training dataset

From the presented plot, the inherent structure of the donut can be easily observed in the two-dimensional space

7.3.1 Heat Maps

Now let’s plot the Voronoi Tessellation with the heatmap overlaid for all the features in the torus data for better visualization and interpretation of data patterns and distributions.

The heatmaps displayed below provides a visual representation of the spatial characteristics of the torus, allowing us to observe patterns and trends in the distribution of each of the features (x,y,z). The sheer green shades highlight regions with higher coordinate values in each of the heatmaps, while the indigo shades indicate areas with the lowest coordinate values in each of the heatmaps. By analyzing these heatmaps, we can gain insights into the variations and relationships between each of these features within the torus structure.

plotHVT(
  hvt.torus3,
  child.level = 1,
  hmap.cols = "n",
  line.width = c(0.4),
  color.vec = c("navy blue"),
  centroid.size = 0.8,
  plot.type = '2Dheatmap'
)
Figure 13: The Voronoi tessellation for layer 1 and number of cells 900 with the heat map overlaid for `No. of entities in each cell` in the torus dataset

Figure 13: The Voronoi tessellation for layer 1 and number of cells 900 with the heat map overlaid for No. of entities in each cell in the torus dataset

plotHVT(
  hvt.torus3,
  child.level = 1,
  hmap.cols = "x",
  line.width = c(0.4),
  color.vec = c("navy blue"),
  centroid.size = 0.8,
  plot.type = '2Dheatmap'
)
Figure 14: The Voronoi tessellation for layer 1 and number of cells 900 with the heat map overlaid for variable `x` in the torus dataset

Figure 14: The Voronoi tessellation for layer 1 and number of cells 900 with the heat map overlaid for variable x in the torus dataset

plotHVT(
  hvt.torus3,
  child.level = 1,
  hmap.cols = "y",
  line.width = c(0.4),
  color.vec = c("navy blue"),
  centroid.size = 0.8,
  plot.type = '2Dheatmap'
)
Figure 15: The Voronoi tessellation for layer 1 and number of cells 900 with the heat map overlaid for variable `y` in the torus dataset

Figure 15: The Voronoi tessellation for layer 1 and number of cells 900 with the heat map overlaid for variable y in the torus dataset

plotHVT(
  hvt.torus3,
  child.level = 1,
  hmap.cols = "z",
  line.width = c(0.4),
  color.vec = c("navy blue"),
  centroid.size = 0.8,
  plot.type = '2Dheatmap'
)
Figure 16: The Voronoi tessellation for layer 1 and number of cells 900 with the heat map overlaid for variable `z` in the torus dataset

Figure 16: The Voronoi tessellation for layer 1 and number of cells 900 with the heat map overlaid for variable z in the torus dataset

7.4 Step 4: Scoring(scoreHVT)

Let’s try to comprehend the scoreHVT function first before moving on

scoreHVT(data,
         hvt.results.model,
         child.level,
         mad.threshold,
         line.width,
         color.vec,
         normalize,
         seed,
         distance_metric,
         error_metric,
         yVar)

The parameters for the function scoreHVT are explained below:

Now once we have built the model, let us try to score using our testing dataset (containing 2400 data points) which cell and which level each point belongs to.

set.seed(240)
scoring_torus <- scoreHVT(
  torus_test,
  hvt.torus3,
  child.level = 1,
  line.width = c(1.2),
  color.vec = c("navy blue"),
  normalize = FALSE
)

Let’s see which cell and level each point belongs to and check the mean absolute difference for each of the 2400 records. For the sake of brevity, we will only show the first 100 rows

Act_pred_Table <- scoring_torus[["actual_predictedTable"]]
rownames(Act_pred_Table) <- NULL
Act_pred_Table %>% head(100) %>%as.data.frame() %>%Table(scroll = TRUE, limit = 100)
Row.No act_x act_y act_z Cell.ID pred_x pred_y pred_z diff
1 -2.6282 0.5656 -0.7253 426 -2.7861000 0.4909889 -0.5494556 0.1361185
2 2.7471 -0.9987 -0.3848 383 2.7102500 -1.1151786 -0.3466500 0.0638262
3 -2.4446 -1.6528 0.3097 43 -2.4912333 -1.5446333 0.3543000 0.0664667
4 -2.6487 -0.5745 0.7040 137 -2.6677286 -0.6784143 0.6529000 0.0580143
5 -0.2676 -1.0800 -0.4611 280 -0.3424625 -1.0900250 -0.5146375 0.0461417
6 -1.1130 -0.6516 -0.7040 302 -1.0766500 -0.7157500 -0.7054333 0.0339778
7 2.0288 1.9519 0.5790 872 2.1083937 1.8521312 0.5822250 0.0608625
8 -2.4799 1.6863 -0.0470 706 -2.4963000 1.6576625 0.0138250 0.0352875
9 -0.4105 -1.1610 -0.6398 254 -0.4811500 -1.1816250 -0.6884375 0.0466375
10 -0.2545 -1.6160 -0.9314 177 -0.3455769 -1.6844538 -0.9581538 0.0620949
11 1.1500 0.3945 -0.6205 551 1.1592500 0.3021167 -0.5940083 0.0427083
12 -1.2557 -1.1369 0.9520 179 -1.3723333 -1.2099583 0.9824583 0.0733833
13 -0.5449 -2.6892 -0.6684 28 -0.6422167 -2.7472583 -0.5593250 0.0881500
14 2.9093 0.7222 -0.0697 800 2.9212455 0.6341636 0.0878182 0.0858333
15 2.3205 1.2520 -0.7711 827 2.3815500 1.2497583 -0.7169583 0.0391444
16 1.4772 -0.5194 -0.9008 461 1.5248286 -0.5221071 -0.9202357 0.0232571
17 -1.3176 -2.6541 0.2690 3 -1.3335700 -2.6427300 0.2473100 0.0163433
18 1.0687 0.1211 -0.3812 513 1.0284222 0.0943667 -0.2486556 0.0665185
19 -0.9632 0.3283 -0.1866 463 -0.9568400 0.3053600 -0.0962600 0.0398800
20 2.5616 0.4634 0.7976 761 2.5270308 0.6503462 0.7860308 0.0776949
21 2.8473 -0.9303 -0.0955 389 2.7885875 -1.0869375 -0.0121750 0.0995583
22 -0.5293 -0.8566 0.1173 320 -0.5603071 -0.8281500 0.0159214 0.0536119
23 -1.9898 -2.1766 0.3150 4 -2.0152273 -2.1853818 0.2187000 0.0435030
24 -0.8845 -1.2219 -0.8709 243 -0.8770917 -1.1194167 -0.8154083 0.0551278
25 0.1553 2.2566 0.9651 791 -0.1405917 2.2460417 0.9644250 0.1023750
26 2.4262 -0.6069 -0.8655 459 2.3732375 -0.7269125 -0.8739250 0.0604667
27 -0.0667 -1.4627 -0.8444 225 -0.0210550 -1.5008750 -0.8642200 0.0345467
28 -0.0655 -1.3311 -0.7448 268 0.0307870 -1.2696174 -0.6801696 0.0741333
29 1.9592 1.5104 0.8806 804 1.9135273 1.3923091 0.9257000 0.0696212
30 1.2332 2.5452 0.5603 865 1.2345700 2.4195300 0.6913900 0.0860433
31 -0.8720 0.4903 0.0287 483 -0.8963687 0.4475438 0.0445500 0.0276583
32 0.2194 -1.7686 0.9760 159 0.1554083 -1.8447417 0.9876583 0.0505972
33 1.5052 0.0445 -0.8694 532 1.4298583 0.0473750 -0.8163250 0.0437639
34 -2.8410 -0.8651 0.2439 103 -2.7982125 -0.8668250 0.3440250 0.0482125
35 1.3203 -2.5967 0.4077 63 1.3162333 -2.6359267 0.3024667 0.0495089
36 -1.5648 1.5577 0.9781 650 -1.5836385 1.4846462 0.9832692 0.0323538
37 0.3589 -1.0419 -0.4400 340 0.3037556 -1.0094222 -0.3189667 0.0695519
38 -0.2900 -2.0106 0.9995 130 -0.1955250 -1.9585250 0.9970500 0.0496667
39 0.5300 1.3668 0.8455 698 0.5965300 1.4019600 0.8775000 0.0445633
40 1.0254 -0.6738 0.6344 409 1.0870214 -0.6666786 0.6896429 0.0413286
41 -0.9306 0.3664 0.0154 483 -0.8963687 0.4475438 0.0445500 0.0481750
42 2.3888 -1.0670 0.7875 411 2.4265714 -0.8913429 0.8056000 0.0771762
43 -0.9830 -0.2043 -0.0897 408 -0.9991286 -0.0896000 -0.0843286 0.0454000
44 0.9499 0.3135 0.0261 541 0.9397000 0.3632111 0.1245333 0.0527815
45 -1.8079 -1.4936 0.9386 127 -1.8869417 -1.2740417 0.9564250 0.1054750
46 1.8399 -1.9295 -0.7459 160 1.8944000 -1.9456429 -0.6913000 0.0417476
47 -0.3304 -1.8481 0.9925 125 -0.4324500 -1.9563625 0.9973125 0.0717083
48 -2.2806 -1.8984 0.2536 15 -2.2888636 -1.9240636 0.1106455 0.0589606
49 -2.3323 1.7320 0.4252 739 -2.1899300 1.9034100 0.4187600 0.1067400
50 0.5520 0.8441 0.1308 593 0.5596300 0.8285200 0.0435700 0.0368133
51 -0.9449 2.2273 0.9078 755 -0.8714400 2.1562000 0.9416200 0.0594600
52 0.2334 -1.4612 -0.8540 214 0.2577091 -1.6095545 -0.9257455 0.0814697
53 2.7387 0.9703 0.4244 817 2.7692917 0.8936417 0.4020917 0.0431861
54 0.3561 1.1619 -0.6199 645 0.2788000 1.1331571 -0.5540571 0.0572952
55 1.7006 1.5569 -0.9522 808 1.5679000 1.6145000 -0.9663727 0.0681576
56 1.7244 -0.5698 0.9829 467 1.7511286 -0.4691643 0.9789000 0.0437881
57 0.9922 1.1438 -0.8741 713 0.9755000 1.2659600 -0.9146000 0.0597867
58 -0.3022 -1.3611 0.7956 227 -0.2324111 -1.3532222 0.7769111 0.0321185
59 -0.9693 1.0602 0.8261 542 -1.0397800 0.9501500 0.8042900 0.0674467
60 1.1313 -0.3595 -0.5824 485 1.1683750 -0.2271833 -0.5861833 0.0577250
61 -0.7561 -2.5384 -0.7611 60 -0.7745733 -2.4460867 -0.8155067 0.0550644
62 2.3168 1.8924 0.1302 892 2.3435429 1.8565571 -0.0876286 0.0934714
63 1.2363 -2.6444 -0.3939 56 1.1031375 -2.7553125 -0.2099688 0.1426688
64 -1.3204 -0.6281 0.8430 260 -1.2649867 -0.7453333 0.8451933 0.0582800
65 1.3733 1.1877 0.9829 716 1.4450500 1.0387850 0.9726800 0.0769617
66 1.0874 -0.1278 0.4251 511 1.1091476 0.0676238 0.4548190 0.0822968
67 2.1300 -1.2171 -0.8914 335 2.0937000 -1.2695750 -0.8886417 0.0305111
68 1.6863 -0.5945 0.9773 467 1.7511286 -0.4691643 0.9789000 0.0639214
69 0.8504 1.0927 -0.7882 681 0.8766125 1.1233500 -0.8183938 0.0290188
70 0.3029 1.0731 0.4656 630 0.3353417 1.0883583 0.5086250 0.0302417
71 -1.4724 1.1331 0.9899 567 -1.5759950 1.1577050 0.9948200 0.0443733
72 -0.5452 -1.2243 0.7514 223 -0.5124667 -1.3020167 0.7994000 0.0528167
73 -1.6866 2.1137 0.7101 763 -1.7228000 2.1506571 0.6477857 0.0451571
74 1.2012 -2.0386 -0.9305 163 1.1917455 -2.0784727 -0.9127727 0.0223515
75 -0.2108 2.3579 0.9301 791 -0.1405917 2.2460417 0.9644250 0.0721306
76 -0.5982 1.3776 -0.8671 656 -0.4744000 1.3518857 -0.8221500 0.0648214
77 -0.2116 -1.0573 -0.3878 303 -0.1932111 -1.0353111 -0.3239222 0.0347519
78 -0.7802 -0.9000 -0.5880 275 -0.7548833 -0.9675500 -0.6347000 0.0465222
79 1.0850 -1.6815 1.0000 182 1.1240000 -1.7907429 0.9880643 0.0533929
80 1.5563 0.1715 -0.9008 617 1.6169563 0.3211500 -0.9319125 0.0804729
81 -0.3790 1.4273 0.8522 652 -0.4341750 1.3844875 0.8355375 0.0382167
82 -1.2769 -0.2633 0.7178 347 -1.3496222 -0.3045778 0.7869778 0.0610593
83 -1.6039 2.4566 0.3575 798 -1.5681929 2.4296786 0.4418857 0.0490048
84 -0.9297 2.4281 -0.8000 797 -0.7923857 2.4132571 -0.8399714 0.0640429
85 0.5324 -0.8526 0.1016 376 0.5584667 -0.8318667 0.0377556 0.0368815
86 0.3928 1.5433 -0.9132 722 0.4812750 1.4948125 -0.9015875 0.0495250
87 1.0031 0.3850 -0.3786 543 1.0104182 0.3301182 -0.3487273 0.0306909
88 -0.7562 0.7889 -0.4207 536 -0.7556625 0.8105500 -0.4519625 0.0178167
89 -1.0870 -0.7523 -0.7350 302 -1.0766500 -0.7157500 -0.7054333 0.0254889
90 -1.8671 -0.8423 -0.9988 199 -1.8501833 -0.8413167 -0.9990500 0.0060500
91 0.8325 -0.9413 0.6689 351 0.9464455 -0.9779455 0.7691091 0.0836000
92 -0.3355 0.9636 0.2005 574 -0.2996538 0.9779923 0.2180000 0.0225795
93 -1.0089 -0.6007 0.5639 296 -1.0457687 -0.6973562 0.6662687 0.0786312
94 1.7725 1.7153 -0.8845 833 1.8060467 1.6629667 -0.8849267 0.0287689
95 0.5539 -0.8888 0.3037 360 0.5277417 -0.8995667 0.2905750 0.0166833
96 0.8149 -2.6016 0.6874 73 0.7755286 -2.6045571 0.6901000 0.0150095
97 0.1104 1.7654 -0.9729 757 0.2032750 1.8113750 -0.9818500 0.0492667
98 1.0107 0.3118 0.3349 537 1.0149000 0.2934500 0.3334625 0.0079958
99 2.2697 -0.3642 0.9543 473 2.1918000 -0.5223000 0.9648643 0.0821881
100 0.4983 -0.8672 -0.0185 376 0.5584667 -0.8318667 0.0377556 0.0505852
hist(Act_pred_Table$diff, breaks = 20, col = "blue", main = "Mean Absolute Difference", xlab = "Difference",xlim = c(0,0.20), ylim = c(0,500))
Figure 17: Mean Absolute Difference

Figure 17: Mean Absolute Difference

8. Example II: HVT with the Personal Computer dataset

Data Understanding

In this section, we will use the Prices of Personal Computers dataset. This dataset contains 6259 observations and 10 features. The dataset observes the price from 1993 to 1995 of 486 personal computers in the US. The variables are price, speed, ram, screen, cd, etc. The dataset can be downloaded from here.

In this example, we will compress this dataset by using hierarchical VQ via k-means and visualize the Voronoi Tessellation plots using Sammons projection. Later on, we will overlay all the variables as a heatmap to generate further insights.

Here, we load the data and store into a variable computers.

set.seed(240)
computers <- read.csv("https://raw.githubusercontent.com/Mu-Sigma/HVT/master/vignettes/sample_dataset/Computers.csv")

Personal Computers Dataset

The Computers dataset includes the following columns:

Let’s explore the Personal Computers Dataset containing (6259 points). For the sake of brevity we are displaying first six rows.

computers <- computers[,-1]
Table(head(computers))
price speed hd ram screen cd multi premium ads trend
1499 25 80 4 14 no no yes 94 1
1795 33 85 2 14 no no yes 94 1
1595 25 170 4 15 no no yes 94 1
1849 25 170 8 14 no no no 94 1
3295 33 340 16 14 no no yes 94 1
3695 66 340 16 14 no no yes 94 1

Now let’s have a look at structure of the dataset.

str(computers)
#> 'data.frame':    6259 obs. of  10 variables:
#>  $ price  : int  1499 1795 1595 1849 3295 3695 1720 1995 2225 2575 ...
#>  $ speed  : int  25 33 25 25 33 66 25 50 50 50 ...
#>  $ hd     : int  80 85 170 170 340 340 170 85 210 210 ...
#>  $ ram    : int  4 2 4 8 16 16 4 2 8 4 ...
#>  $ screen : int  14 14 15 14 14 14 14 14 14 15 ...
#>  $ cd     : chr  "no" "no" "no" "no" ...
#>  $ multi  : chr  "no" "no" "no" "no" ...
#>  $ premium: chr  "yes" "yes" "yes" "no" ...
#>  $ ads    : int  94 94 94 94 94 94 94 94 94 94 ...
#>  $ trend  : int  1 1 1 1 1 1 1 1 1 1 ...

Further process will be carried out after removing non-numeric columns from the dataset, since the distribution plots will take only the continuous variables and K-means is not suitable for factor variables as the sample space for factor variables is discrete. A Euclidean distance function on such a space isn’t really meaningful. Hence, we will delete the factor variables(X, cd, multi, premium, trend) in our dataset.

computers <-computers %>% dplyr::select(-c( cd, multi, premium, trend))

Data Distribution

This section displays four objects.

  1. Variable Histograms: The histogram distribution of all the variables in the dataset.

  2. Box Plots: Box plots for each numeric column in the dataset across panels. These plots will display the median and Inter Quartile Range of each column at a panel level.

  3. Correlation Matrix: This calculates the pearson correlation which is a bivariate correlation value measuring the linear correlation between two numeric columns. The output plot is shown as a matrix.

  4. Summary EDA: The table provides descriptive statistics for all the variables in the dataset.

It uses an inbuilt function called edaPlots to display the above mentioned four objects.

edaPlots(computers)
variable min 1st Quartile median mean sd 3rd Quartile max hist n_row n_missing
price 949 1794.0 2144 2219.576610 580.8039557 2595 5399 ▅▇▂▁▁ 6259 0
speed 25 33.0 50 52.011024 21.1577354 66 100 ▇▃▆▁▂ 6259 0
hd 80 214.0 340 416.601694 258.5484452 528 2100 ▇▃▁▁▁ 6259 0
ram 2 4.0 8 8.286947 5.6310989 8 32 ▇▁▂▁▁ 6259 0
screen 14 14.0 14 14.608723 0.9051152 15 17 ▇▅▁▁▁ 6259 0
ads 39 162.5 246 221.301007 74.8352840 275 339 ▂▃▅▇▆ 6259 0

Train - Test Split

Let us split the computers data into train and test. We will randomly select 80% of the data as train and remaining as test.

num_rows <- nrow(computers)
set.seed(123)
train_indices <- sample(1:num_rows, 0.8 * num_rows)
trainComputers <- computers[train_indices, ]
testComputers <- computers[-train_indices, ]

Training Dataset

Now, lets have a look at the randomly selected training dataset containing (5007 data points). For the sake of brevity we are displaying first six rows.

trainComputers_data <- trainComputers %>% as.data.frame() %>% round(4)
trainComputers_data <- trainComputers_data %>% dplyr::select(price,speed,hd,ram,screen,ads)
row.names(trainComputers_data) <- NULL
Table(head(trainComputers_data))
price speed hd ram screen ads
2799 50 230 8 15 216
2197 33 270 4 14 216
2744 50 340 8 17 275
2999 66 245 16 15 139
1974 33 200 4 14 248
2490 33 528 16 14 267

Now let’s have a look at structure of the training dataset.

str(trainComputers_data)
#> 'data.frame':    5007 obs. of  6 variables:
#>  $ price : num  2799 2197 2744 2999 1974 ...
#>  $ speed : num  50 33 50 66 33 33 66 33 25 50 ...
#>  $ hd    : num  230 270 340 245 200 528 424 212 528 545 ...
#>  $ ram   : num  8 4 8 16 4 16 16 4 16 4 ...
#>  $ screen: num  15 14 17 15 14 14 15 17 14 15 ...
#>  $ ads   : num  216 216 275 139 248 267 259 298 307 158 ...

Data Distribution

edaPlots(trainComputers_data)
variable min 1st Quartile median mean sd 3rd Quartile max hist n_row n_missing
price 949 1794 2144 2217.771120 577.4055716 2594 5399 ▅▇▂▁▁ 5007 0
speed 25 33 50 51.717595 21.0202929 66 100 ▇▃▆▁▂ 5007 0
hd 80 214 340 413.167765 256.7071661 528 2100 ▇▃▁▁▁ 5007 0
ram 2 4 8 8.282804 5.6427034 8 32 ▇▁▂▁▁ 5007 0
screen 14 14 14 14.610146 0.9051065 15 17 ▇▅▁▁▁ 5007 0
ads 39 163 246 222.036948 74.3836093 275 339 ▂▃▅▇▆ 5007 0

Testing Dataset

Now, lets have a look at the testing dataset containing (1252 data points). For the sake of brevity we are displaying first six rows.

testComputers_data <- testComputers %>% as.data.frame() %>% round(4)
testComputers_data <- testComputers_data %>% dplyr::select(price,speed,hd,ram,screen,ads)
rownames(testComputers_data) <- NULL
Table(head(testComputers_data))
price speed hd ram screen ads
1595 25 170 4 15 94
1849 25 170 8 14 94
1720 25 170 4 14 94
2575 50 210 4 15 94
2195 33 170 8 15 94
2295 25 245 8 14 94

Now let’s have a look at structure of the testing dataset.

str(testComputers_data)
#> 'data.frame':    1252 obs. of  6 variables:
#>  $ price : num  1595 1849 1720 2575 2195 ...
#>  $ speed : num  25 25 25 50 33 25 50 33 66 50 ...
#>  $ hd    : num  170 170 170 210 170 245 212 250 130 210 ...
#>  $ ram   : num  4 8 4 4 8 8 8 4 4 4 ...
#>  $ screen: num  15 14 14 15 15 14 14 15 14 17 ...
#>  $ ads   : num  94 94 94 94 94 94 94 94 94 94 ...

Data Distribution

edaPlots(testComputers_data)
variable min 1st Quartile median mean sd 3rd Quartile max hist n_row n_missing
price 999 1788 2168 2226.797125 594.3804481 2599 5399 ▅▇▂▁▁ 1252 0
speed 25 33 50 53.184505 21.6675406 66 100 ▇▃▇▁▂ 1252 0
hd 80 214 420 430.334664 265.4453368 528 2100 ▇▃▁▁▁ 1252 0
ram 2 4 8 8.303514 5.5866594 8 32 ▇▁▂▁▁ 1252 0
screen 14 14 14 14.603035 0.9054893 15 17 ▇▅▁▁▁ 1252 0
ads 39 162 246 218.357828 76.5745442 275 339 ▂▃▅▇▆ 1252 0

As we are familiar with the structure of the computers data, we will now follow the following steps to get the scores using the Computers dataset.

8.1 Step 1: Data Compression

For more detailed information on Data Compression please refer to section 7.1 of this vignette.

We will use the trainHVT function to compress our data while preserving essential features of the dataset. Our goal is to achieve data compression upto atleast 80%. In situations where the compression ratio does not meet the desired target, we can explore adjusting the model parameters as a potential solution. This involves making modifications to parameters such as the quantization error threshold or increasing the number of cells and then rerunning the trainHVT function again.

We will pass the below mentioned model parameters along with computers training dataset (5007) to trainHVT function.

Model Parameters

set.seed(240)
hvt.results <- list()
hvt.results <- trainHVT(trainComputers,   
                          n_cells = 440,
                          depth = 1,
                          quant.err = 0.2,
                          normalize = TRUE,
                          distance_metric = "L1_Norm",
                          error_metric = "max",
                          quant_method = "kmeans")

Now let’s check the compression summary. The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.

displayTable(data = hvt.results[[3]]$compression_summary,columnName = 'percentOfCellsBelowQuantizationErrorThreshold', value = 0.8, tableType = "compression")
segmentLevel noOfCells noOfCellsBelowQuantizationError percentOfCellsBelowQuantizationErrorThreshold parameters
1 440 355 0.81 n_cells: 440 quant.err: 0.2 distance_metric: L1_Norm error_metric: max quant_method: kmeans

As it can be seen from the table above, 81% of the cells have reached the quantization threshold error. Since we are successfully able to attain the desired compression percentage, so we will not further subdivide the cells

hvt.results[[3]] gives us detailed information about the hierarchical vector quantized data.

hvt.results[[3]][['summary']] gives a nice tabular data containing no of points, Quantization Error and the codebook.

The datatable displayed below is the summary from hvt.results showing Cell.IDs, Centroids and Quantization Error for the 440 cells.

For the sake of brevity, we are displaying only the first 100 rows.

displayTable(data =hvt.results[[3]][['summary']], columnName= 'Quant.Error', value = 0.2, tableType = "summary")
Segment.Level Segment.Parent Segment.Child n Cell.ID Quant.Error price speed hd ram screen ads
1 1 1 7 46 0.08 -0.76 -0.89 -0.88 -0.76 -0.67 1.57
1 1 2 10 108 0.08 -0.80 -0.89 -0.16 -0.76 -0.67 0.67
1 1 3 15 223 0.12 0.37 -0.89 -0.72 -0.05 0.43 -1.65
1 1 4 11 54 0.07 -1.50 -0.89 -0.75 -0.76 -0.67 0.62
1 1 5 8 146 0.13 -0.31 0.68 -0.95 -0.89 -0.67 -0.14
1 1 6 11 150 0.16 -0.66 0.68 -0.78 -0.79 -0.67 -0.73
1 1 7 11 170 0.1 0.03 -1.24 -0.13 -0.05 -0.67 0.38
1 1 8 8 334 0.15 0.62 2.30 0.08 -0.05 0.43 0.04
1 1 9 8 114 0.07 -0.16 0.68 -1.19 -1.11 -0.67 0.87
1 1 10 7 248 0.17 0.51 -0.08 0.34 -0.05 -0.67 -0.30
1 1 11 9 140 0.12 -0.01 0.68 -1.15 -1.00 -0.67 0.33
1 1 12 7 219 0.14 -1.36 0.24 0.46 -0.05 -0.67 -0.74
1 1 13 9 271 0.05 -1.08 0.68 0.49 -0.05 0.43 -0.84
1 1 14 19 109 0.06 -0.31 -0.89 -0.74 -0.76 -0.67 0.38
1 1 15 6 176 0.08 -0.72 -0.89 -0.07 -0.05 0.43 0.72
1 1 16 17 332 0.14 0.42 2.30 0.10 -0.05 0.43 1.50
1 1 17 12 18 0.05 -1.21 -1.27 -1.19 -1.11 -0.67 0.97
1 1 18 19 149 0.16 -0.68 -0.08 -0.46 -0.76 0.43 0.79
1 1 19 17 428 0.35 0.18 2.30 2.53 1.37 0.43 -2.22
1 1 20 20 320 0.36 0.82 -0.16 -0.09 -0.12 2.64 0.71
1 1 21 3 305 0.18 2.27 -0.35 -0.01 -0.05 -0.67 -1.32
1 1 22 7 227 0.1 -0.51 -0.89 0.45 -0.05 0.43 -0.47
1 1 23 10 178 0.12 0.00 0.68 -0.86 -0.76 -0.67 -0.90
1 1 24 9 365 0.1 0.68 -0.08 1.20 1.37 0.43 -0.36
1 1 25 5 14 0.11 -1.99 -0.89 -0.96 -1.11 -0.67 0.18
1 1 26 3 411 0.05 1.25 -0.89 2.29 2.79 0.43 0.57
1 1 27 18 122 0.15 -0.18 -0.98 -0.85 -0.76 0.43 0.68
1 1 28 15 189 0.11 0.40 -0.92 0.03 -0.05 -0.67 0.87
1 1 29 11 107 0.11 -0.49 -0.96 -0.88 -0.76 -0.67 -0.64
1 1 30 7 423 0.47 3.55 0.12 2.51 1.37 -0.67 0.44
1 1 31 14 90 0.05 -0.63 -0.89 -0.79 -0.76 -0.67 0.58
1 1 32 22 430 0.24 0.63 0.75 3.07 2.79 0.43 -2.27
1 1 33 5 390 0.3 1.37 -0.89 3.73 -0.19 -0.45 0.70
1 1 34 25 101 0.18 -0.85 -0.97 -0.71 -0.76 0.43 0.84
1 1 35 11 425 0.07 0.15 2.30 1.70 1.37 0.43 -2.39
1 1 36 10 358 0.05 0.24 -0.89 1.20 1.37 0.43 -0.84
1 1 37 16 166 0.11 0.03 0.68 -1.08 -0.78 -0.67 -1.65
1 1 38 13 45 0.05 -0.91 -0.89 -1.19 -1.11 -0.67 0.42
1 1 39 8 383 0.12 1.15 2.30 0.45 1.37 -0.67 -0.16
1 1 40 5 9 0.07 -1.24 -0.97 -1.19 -1.11 -0.67 1.57
1 1 41 11 419 0.06 1.41 -0.89 2.29 2.79 0.43 -0.82
1 1 42 8 242 0.14 -0.81 -0.08 0.30 -0.05 0.43 -0.65
1 1 43 13 179 0.09 0.41 0.68 -0.76 -0.76 -0.67 0.40
1 1 44 5 375 0.04 0.06 -0.08 1.70 1.37 0.43 -0.79
1 1 45 20 129 0.14 -1.15 0.68 -0.79 -0.76 -0.67 -0.40
1 1 46 10 292 0.22 0.86 0.68 -0.65 -0.12 0.43 -1.41
1 1 47 5 79 0.12 -0.89 -1.04 -0.94 -0.05 -0.67 1.02
1 1 48 23 246 0.11 -0.42 0.68 0.46 -0.05 -0.67 -0.63
1 1 49 8 207 0.25 0.74 -0.89 -0.40 -0.40 0.43 0.52
1 1 50 11 27 0.06 -1.06 -1.27 -1.19 -1.11 -0.67 0.43
1 1 51 8 51 0.11 -1.23 -0.08 -1.05 -0.85 -0.67 0.94
1 1 52 19 288 0.09 0.81 -0.89 0.45 1.37 -0.67 0.88
1 1 53 7 154 0.1 -0.62 0.68 -0.15 -0.76 -0.67 0.84
1 1 54 10 261 0.15 0.61 -0.08 -0.67 -0.05 0.43 -1.40
1 1 55 10 195 0.15 0.18 0.77 -0.09 -0.76 -0.67 0.83
1 1 56 14 250 0.09 0.52 0.68 -0.69 -0.05 -0.67 -1.61
1 1 57 20 331 0.2 -0.63 0.68 0.30 -0.76 2.64 -0.95
1 1 58 9 379 0.15 1.33 -0.08 -0.65 -0.29 2.64 -1.52
1 1 59 14 11 0.21 -0.28 -1.05 -0.79 -0.76 2.64 0.53
1 1 60 29 359 0.13 1.36 0.68 0.18 1.37 0.43 0.77
1 1 61 6 337 0.1 2.46 0.68 0.21 -0.05 -0.67 -0.87
1 1 62 6 1 0.17 -0.17 -1.21 -1.02 -0.76 2.64 1.32
1 1 63 28 243 0.33 -0.33 2.30 -0.23 -0.46 -0.67 -0.88
1 1 64 8 274 0.07 0.41 -1.27 0.45 1.37 -0.67 0.68
1 1 65 13 362 0.14 1.07 0.75 0.35 1.37 0.43 1.34
1 1 66 10 143 0.07 -0.34 -0.89 -0.80 -0.05 -0.67 -1.66
1 1 67 4 265 0.05 -0.55 0.68 1.23 -0.05 -0.67 -0.69
1 1 68 11 13 0.15 -0.83 -0.89 -0.25 -0.76 2.64 -0.33
1 1 69 8 298 0.17 -0.62 0.20 2.29 -0.05 -0.67 -0.95
1 1 70 4 335 0.06 1.34 -0.08 0.45 1.37 -0.67 -0.08
1 1 71 20 204 0.16 0.09 -0.08 0.02 -0.05 -0.67 0.86
1 1 72 10 42 0.06 -1.49 -0.89 -0.75 -0.76 -0.67 1.04
1 1 73 1 429 0 3.08 0.68 0.04 4.20 0.43 0.71
1 1 74 14 186 0.14 -0.79 -0.89 0.45 -0.05 -0.67 -0.68
1 1 75 4 410 0.37 2.27 0.68 3.73 -0.23 -0.40 0.68
1 1 76 9 163 0.16 1.05 -0.89 -0.41 -0.60 -0.67 0.61
1 1 77 10 400 0.07 -0.03 0.68 1.70 1.37 0.43 -2.38
1 1 78 6 275 0.18 1.14 0.68 0.13 -0.05 -0.67 -0.18
1 1 79 25 241 0.16 -0.88 0.68 0.40 -0.05 -0.67 -1.06
1 1 80 6 245 0.14 -1.22 0.68 -0.30 -0.05 0.43 -0.91
1 1 81 21 120 0.16 -0.46 -0.08 -0.82 -0.62 -0.67 0.70
1 1 82 11 40 0.18 -0.93 -0.99 -1.19 -1.11 0.43 0.37
1 1 83 9 342 0.28 1.16 0.43 -0.52 -0.68 2.64 1.12
1 1 84 8 286 0.05 -0.72 1.11 0.50 -0.05 0.43 -0.83
1 1 85 8 33 0.1 -1.06 -0.89 -1.18 -1.02 -0.67 -0.99
1 1 86 5 282 0.23 -1.50 0.53 0.19 -0.48 0.43 -2.16
1 1 87 17 137 0.07 -0.31 0.68 -0.78 -0.76 -0.67 0.96
1 1 88 19 168 0.07 -0.08 -0.89 0.05 -0.05 -0.67 1.04
1 1 89 7 291 0.05 1.10 -0.89 0.15 1.37 -0.67 0.36
1 1 90 6 24 0.19 -0.97 -1.02 -1.07 -0.94 0.43 -1.40
1 1 91 4 434 0.59 4.07 1.49 1.29 0.30 2.64 0.07
1 1 92 19 409 0.65 0.92 0.42 1.41 1.37 2.64 -0.60
1 1 93 9 393 0.46 2.04 2.30 0.91 -0.05 -0.31 -0.34
1 1 94 8 22 0.08 -1.66 -1.27 -0.84 -0.76 -0.67 0.84
1 1 95 23 158 0.22 -1.41 0.68 -0.20 -0.70 -0.67 -1.05
1 1 96 11 330 0.22 1.55 0.68 -0.45 -0.31 0.43 -1.59
1 1 97 6 145 0.17 0.49 -0.89 -0.64 -0.76 -0.67 -0.17
1 1 98 12 121 0.15 0.21 -0.89 -0.80 -0.76 -0.67 -1.70
1 1 99 14 329 0.24 2.06 0.63 0.31 -0.25 0.43 0.99
1 1 100 16 299 0.05 1.21 -0.89 0.46 1.37 -0.67 0.86

Now let us understand what each column in the above summary table means:

All the columns after this will contain centroids for each cell. They can also be called a codebook, which represents a collection of all centroids or codewords.

plotHVT(hvt.results, plot.type = '1D')

Figure 18: Sammons 1D x Cell ID plot for layer 1 shown for the 440 cells in the dataset ’computers’

8.2 Step 2: Data Projection

For more detailed information on Data Projection please refer to section 7.2 of this vignette.

Lets visualize the projected Sammons 2D for n_cell set to 440 onto a plane.

plotHVT(hvt.results, plot.type = '2Dproj')
Figure 19: Sammons 2D Plot for 440 cells

Figure 19: Sammons 2D Plot for 440 cells

8.3 Step 3: Tessellation

For more detailed information on voronoi tessellation please refer to section 7.3 of this vignette.

For better visualisation, let’s plot the Voronoi tessellation using the plotHVT function.

plotHVT(hvt.results,
        line.width = c(0.2), 
        color.vec = c("navy blue"),
        centroid.size = 0.01,  
        maxDepth = 1,
        plot.type = '2Dhvt')
Figure 20: The Voronoi Tessellation for layer 1 shown for the 440 cells in the dataset ’computers’

Figure 20: The Voronoi Tessellation for layer 1 shown for the 440 cells in the dataset ’computers’

8.3.1 Heat Maps

Now let’s plot the Voronoi Tessellation with the heatmap overlaid for all the features in the computers dataset for better visualization.

The heatmaps displayed below provides a visual representation of the spatial characteristics of the computers data, allowing us to observe patterns and trends in the distribution of each of the features (price,speed,hd,ram,screen,ads). The sheer green shades highlight regions with higher values in each of the heatmaps, while the indigo shades indicate areas with the lowest values in each of the heatmaps. By analyzing these heatmaps, we can gain insights into the variations and relationships between each of these features within the computers data

plotHVT(
  hvt.results,
  child.level = 1,
  hmap.cols = "n",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.03,
  plot.type = '2Dheatmap'
)
Figure 21: The Voronoi Tessellation with the heat map overlaid over the `No. of entities in each cell` in the ’computers’ dataset

Figure 21: The Voronoi Tessellation with the heat map overlaid over the No. of entities in each cell in the ’computers’ dataset


plotHVT(
  hvt.results,
  child.level = 1,
  hmap.cols = "price",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.03,
  plot.type = '2Dheatmap'
)
Figure 22: The Voronoi Tessellation with the heat map overlaid over the variable `price` in the ’computers’ dataset

Figure 22: The Voronoi Tessellation with the heat map overlaid over the variable price in the ’computers’ dataset


plotHVT(
  hvt.results,
  child.level = 1,
  hmap.cols = "hd",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.03,
  plot.type = '2Dheatmap'
)
Figure 23: The Voronoi Tessellation with the heat map overlaid over the variable `hd` in the ’computers’ dataset

Figure 23: The Voronoi Tessellation with the heat map overlaid over the variable hd in the ’computers’ dataset

plotHVT(
  hvt.results,
  child.level = 1,
  hmap.cols = "ram",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.03,
  plot.type = '2Dheatmap'
)
Figure 24: The Voronoi Tessellation with the heat map overlaid over the variable `ram` in the ’computers’ dataset

Figure 24: The Voronoi Tessellation with the heat map overlaid over the variable ram in the ’computers’ dataset

plotHVT(
  hvt.results,
  child.level = 1,
  hmap.cols = "screen",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.03,
  plot.type = '2Dheatmap'
)
Figure 25: The Voronoi Tessellation with the heat map overlaid over the variable `screen` in the ’computers’ dataset

Figure 25: The Voronoi Tessellation with the heat map overlaid over the variable screen in the ’computers’ dataset


plotHVT(
  hvt.results,
  child.level = 1,
  hmap.cols = "ads",
  line.width = c(0.2),
  color.vec = c("navy blue"),
  centroid.size = 0.03,
  plot.type = '2Dheatmap'
)
Figure 26: The Voronoi Tessellation with the heat map overlaid over the variable `ads` in the ’computers’ dataset

Figure 26: The Voronoi Tessellation with the heat map overlaid over the variable ads in the ’computers’ dataset

8.4 Step 4: Scoring(scoreHVT)

For more detailed information on scoring please refer to section 7.4 of this vignette.

Now once we have built the model, let us try to score using our testing dataset containing(1252 data points) which cell and which level each point belongs to.

scoreHVT(data,
         hvt.results.model,
         child.level,
         mad.threshold,
         line.width,
         color.vec,
         normalize,
         seed,
         distance_metric,
         error_metric,
         yVar)

The parameters for the function scoreHVT are explained below:

set.seed(240)
scoring_comp <-scoreHVT(
  testComputers,
  hvt.results,
  child.level = 1,
  line.width = c(1.2),
  color.vec = c("navy blue"),
  normalize = TRUE
)

When normalize is set to TRUE while using scoreHVT, the function has an inbuilt feature to standardize the testing dataset based on the mean and standard deviation of the training dataset from the trainHVT results.

Let’s see which cell and level each point belongs to and check the mean absolute difference of each of the 1252 records. For the sake of brevity, we will only show the first 100 rows.

Act_pred_Table <- scoring_comp[["actual_predictedTable"]]
rownames(Act_pred_Table) <- NULL
Act_pred_Table %>% head(100) %>%as.data.frame() %>%Table(scroll = TRUE, limit = 100)
Row.No act_price act_speed act_hd act_ram act_screen act_ads Cell.ID pred_price pred_speed pred_hd pred_ram pred_screen pred_ads diff
3 -1.0786 -1.2710 -0.9473 -0.7590 0.4307 -1.7213 24 -0.9674386 -1.0173151 -1.0699653 -0.9362186 0.4307274 -1.4008948 0.1641938
4 -0.6387 -1.2710 -0.9473 -0.0501 -0.6741 -1.7213 143 -0.3449415 -0.8904536 -0.7968915 -0.0501185 -0.6741149 -1.6554312 0.1484359
7 -0.8621 -1.2710 -0.9473 -0.7590 -0.6741 -1.7213 29 -0.6009525 -1.2710382 -1.0771071 -0.7589986 -0.6741149 -1.6630494 0.0748766
10 0.6187 -0.0817 -0.7914 -0.7590 0.4307 -1.7213 237 0.4887879 0.1720118 -0.7343040 -0.7589986 0.4307274 -1.5151673 0.1078136
11 -0.0394 -0.8905 -0.9473 -0.0501 0.4307 -1.7213 223 0.3675560 -0.8904536 -0.7187220 -0.0501185 0.4307274 -1.6549831 0.1169905
14 0.1338 -1.2710 -0.6551 -0.0501 -0.6741 -1.7213 162 0.0090558 -1.0075566 -0.6679808 -0.0501185 -0.6741149 -1.5682532 0.0923581
15 0.8334 -0.0817 -0.7836 -0.0501 -0.6741 -1.7213 215 0.3343904 -0.0817113 -0.6924924 -0.0501185 -0.6741149 -1.6366099 0.1124753
19 -0.2126 -0.8905 -0.6356 -0.7590 0.4307 -1.7213 165 0.0168493 -0.8904536 -0.6745732 -0.7589986 0.4307274 -1.6944183 0.0492299
22 0.9997 0.6795 -1.1031 -0.7590 -0.6741 -1.7213 214 1.0620418 0.1720118 -0.8985638 -0.6408519 -0.6741149 -1.6630494 0.1584633
24 1.1382 -0.0817 -0.7914 -0.7590 2.6404 -1.7213 379 1.3306764 -0.0817113 -0.6529316 -0.2864119 2.6404120 -1.5241299 0.1667877
28 3.0780 -0.8905 0.1513 -0.0501 -0.6741 -1.7213 367 3.0346588 -0.0817113 0.1512706 -0.0501185 -0.6741149 -1.6507259 0.1537945
29 1.5193 -0.8905 -0.2850 1.3676 -0.6741 -1.7213 316 0.9975291 -1.0173151 -0.2850242 1.3676416 -0.6741149 -1.6059131 0.1273423
33 0.6533 -0.8905 -0.7914 -0.0501 2.6404 -1.7213 343 0.8693078 -0.8904536 -0.6467482 -0.3539243 2.6404120 -1.4466679 0.1565291
39 0.3243 -0.0817 -0.7914 -0.0501 -0.6741 -1.7213 215 0.3343904 -0.0817113 -0.6924924 -0.0501185 -0.6741149 -1.6366099 0.0322888
40 0.3589 -0.0817 -0.7914 -0.0501 -0.6741 -1.7213 215 0.3343904 -0.0817113 -0.6924924 -0.0501185 -0.6741149 -1.6366099 0.0346920
43 0.4871 -0.8905 -0.7836 -0.0501 -0.6741 -1.7213 187 0.4268397 -0.8904536 -0.7057371 -0.0501185 -0.6741149 -1.6271992 0.0387173
46 0.8265 -0.8905 -0.6551 -0.0501 -0.6741 -1.7213 187 0.4268397 -0.8904536 -0.7057371 -0.0501185 -0.6741149 -1.6271992 0.0907463
48 -0.8119 -1.2710 -1.1420 -0.7590 -0.6741 -1.7213 29 -0.6009525 -1.2710382 -1.0771071 -0.7589986 -0.6741149 -1.6630494 0.0556909
53 1.3461 0.6795 -0.2850 -0.0501 0.4307 -1.7213 330 1.5526376 0.6794579 -0.4507596 -0.3078931 0.4307274 -1.5917564 0.1266172
62 -0.7322 -0.8905 -0.9473 -0.7590 0.4307 -1.7213 24 -0.9674386 -1.0173151 -1.0699653 -0.9362186 0.4307274 -1.4008948 0.1637284
73 -0.9054 -0.8905 -0.9473 -0.7590 -0.6741 -1.7213 43 -0.8159512 -0.8904536 -1.0411033 -0.7912204 -0.6741149 -1.6455317 0.0485504
74 1.4309 -0.0817 -0.6551 -0.0501 -0.6741 -1.7213 215 0.3343904 -0.0817113 -0.6924924 -0.0501185 -0.6741149 -1.6366099 0.2031061
80 -1.0197 -1.2710 -1.2979 -0.0501 -0.6741 -1.7213 10 -1.2826994 -1.2710382 -1.0498386 -0.8298866 -0.6741149 -1.6424355 0.2282942
85 0.6533 -1.2710 -0.6551 -0.0501 -0.6741 -1.7213 187 0.4268397 -0.8904536 -0.7057371 -0.0501185 -0.6741149 -1.6271992 0.1252963
86 -0.3789 -0.8905 -1.1420 -0.0501 -0.6741 -1.7213 143 -0.3449415 -0.8904536 -0.7968915 -0.0501185 -0.6741149 -1.6554312 0.0741693
93 0.3502 -0.8905 -0.9473 -0.0501 0.4307 -1.7213 223 0.3675560 -0.8904536 -0.7187220 -0.0501185 0.4307274 -1.6549831 0.0520572
99 0.6533 -1.2710 -0.2850 1.3676 -0.6741 -1.7079 316 0.9975291 -1.0173151 -0.2850242 1.3676416 -0.6741149 -1.6059131 0.1166636
100 -0.9054 -0.8905 -0.9473 -0.7590 -0.6741 -1.7079 43 -0.8159512 -0.8904536 -1.0411033 -0.7912204 -0.6741149 -1.6455317 0.0463170
104 1.3530 0.6795 -0.3240 -0.7590 0.4307 -1.7079 330 1.5526376 0.6794579 -0.4507596 -0.3078931 0.4307274 -1.5917564 0.1489529
106 1.3461 0.6795 -0.6356 -0.0501 2.6404 -1.7079 386 1.3665387 0.6794579 -0.6639491 -0.1790058 2.6404120 -1.3607671 0.0874801
110 -0.1260 0.6795 -0.9473 -0.7590 -0.6741 -1.7079 166 0.0277818 0.6794579 -1.0848169 -0.7811511 -0.6741149 -1.6465247 0.0624803
111 1.3461 0.6795 -0.2850 -0.0501 0.4307 -1.7079 330 1.5526376 0.6794579 -0.4507596 -0.3078931 0.4307274 -1.5917564 0.1243839
115 -1.2448 -1.2710 -0.9473 -0.7590 -0.6741 -1.7079 10 -1.2826994 -1.2710382 -1.0498386 -0.8298866 -0.6741149 -1.6424355 0.0461404
118 0.1857 0.6795 -0.6356 -0.0501 -0.6741 -1.7079 250 0.5156322 0.6794579 -0.6943289 -0.0501185 -0.6741149 -1.6079937 0.0814405
133 0.0039 -0.8905 -0.6356 -0.7590 -0.6741 -1.7079 121 0.2098101 -0.8904536 -0.8031243 -0.7589986 -0.6741149 -1.6966589 0.0641230
157 1.3530 -0.0817 -0.9473 -0.7590 -0.6741 -1.7079 214 1.0620418 0.1720118 -0.8985638 -0.6408519 -0.6741149 -1.6630494 0.1260700
158 -0.0394 -0.8905 -0.9473 -0.0501 0.4307 -1.7079 223 0.3675560 -0.8904536 -0.7187220 -0.0501185 0.4307274 -1.6549831 0.1147572
160 2.0458 0.6795 -0.7135 -0.7590 0.4307 -1.7079 330 1.5526376 0.6794579 -0.4507596 -0.3078931 0.4307274 -1.5917564 0.2205371
168 0.6602 -0.0817 -0.7914 -0.7590 -0.6741 -1.7079 214 1.0620418 0.1720118 -0.8985638 -0.6408519 -0.6741149 -1.6630494 0.1542885
174 0.9114 0.6795 -0.6551 -0.0501 -0.6741 -1.7079 250 0.5156322 0.6794579 -0.6943289 -0.0501185 -0.6741149 -1.6079937 0.0891631
175 2.7316 -0.8905 0.1513 -0.0501 -0.6741 -1.7079 367 3.0346588 -0.0817113 0.1512706 -0.0501185 -0.6741149 -1.6507259 0.1948474
176 0.9131 0.6795 -0.6356 -0.7590 0.4307 -1.7079 292 0.8575062 0.6794579 -0.6465256 -0.1210066 0.4307274 -1.4053761 0.1678510
177 0.3139 -0.0817 -0.7836 -0.7590 -0.6741 -1.7079 126 -0.2117972 -0.0817113 -1.0117870 -0.7744091 -0.6741149 -1.6435656 0.1389423
184 0.7382 -0.0817 -0.6551 -0.0501 -0.6741 -1.7079 215 0.3343904 -0.0817113 -0.6924924 -0.0501185 -0.6741149 -1.6366099 0.0854228
188 0.6187 0.6795 -0.6356 -0.0501 -0.6741 -1.7079 250 0.5156322 0.6794579 -0.6943289 -0.0501185 -0.6741149 -1.6079937 0.0436297
189 0.8265 -0.8905 -0.2850 1.3676 -0.6741 -1.7079 316 0.9975291 -1.0173151 -0.2850242 1.3676416 -0.6741149 -1.6059131 0.0666520
190 1.3530 0.6795 -0.6551 1.3676 0.4307 -1.6406 370 1.5474644 0.6794579 -0.3757335 1.3676416 0.4307274 -1.0222272 0.1820525
202 0.9824 -0.8905 -0.6356 -0.0501 2.6404 -1.6406 343 0.8693078 -0.8904536 -0.6467482 -0.3539243 2.6404120 -1.4466679 0.1036759
205 -1.0786 -0.8905 -1.2784 -1.1134 -0.6741 -1.6406 6 -1.2106241 -0.8904536 -1.2808087 -1.1134387 -0.6741149 -1.6372821 0.0229751
208 2.7316 -0.8905 0.1513 -0.0501 -0.6741 -1.6406 367 3.0346588 -0.0817113 0.1512706 -0.0501185 -0.6741149 -1.6507259 0.1870060
210 2.9048 0.6795 0.3383 -0.0501 0.4307 -1.6406 382 3.0339927 0.6794579 0.2297796 -0.0501185 -0.4191513 -1.6840769 0.1885170
221 0.7226 -0.8905 -0.6356 -0.0501 2.6404 -1.6406 343 0.8693078 -0.8904536 -0.6467482 -0.3539243 2.6404120 -1.4466679 0.1092785
225 0.6602 -0.0817 -0.7836 -0.0501 -0.6741 -1.6406 215 0.3343904 -0.0817113 -0.6924924 -0.0501185 -0.6741149 -1.6366099 0.0701587
226 0.9131 0.6795 -0.9473 -0.0501 -0.6741 -1.6406 250 0.5156322 0.6794579 -0.6943289 -0.0501185 -0.6741149 -1.6079937 0.1138535
228 1.1729 -0.0817 -0.2850 1.3676 -0.6741 -1.6406 366 1.6169873 0.3532425 -0.2850242 1.3676416 -0.6741149 -1.6521663 0.1484461
233 -0.3789 -0.8905 -0.9473 -0.7590 -0.6741 -1.6406 85 -0.3540405 -0.8904536 -0.9781217 -0.7589986 -0.6741149 -1.6344382 0.0103176
235 -0.0394 -1.2710 -0.6551 -0.0501 -0.6741 -1.6406 162 0.0090558 -1.0075566 -0.6679808 -0.0501185 -0.6741149 -1.5682532 0.0661934
237 -1.0786 -1.2710 -0.9473 -0.7590 0.4307 -1.6406 24 -0.9674386 -1.0173151 -1.0699653 -0.9362186 0.4307274 -1.4008948 0.1507438
245 0.6187 0.6795 -0.6356 -0.0501 -0.6741 -1.6406 250 0.5156322 0.6794579 -0.6943289 -0.0501185 -0.6741149 -1.6079937 0.0324131
247 1.8726 0.6795 -0.6551 1.3676 0.4307 -1.6406 370 1.5474644 0.6794579 -0.3757335 1.3676416 0.4307274 -1.0222272 0.2038310
248 0.8334 0.6795 -0.7798 -0.0501 -0.6741 -1.6406 250 0.5156322 0.6794579 -0.6943289 -0.0501185 -0.6741149 -1.6079937 0.0726535
254 -0.8621 -1.2710 -0.9473 -0.7590 -0.6741 -1.6406 29 -0.6009525 -1.2710382 -1.0771071 -0.7589986 -0.6741149 -1.6630494 0.0689097
282 -0.3858 -0.8905 -0.6356 -0.7590 -0.6741 -1.6406 85 -0.3540405 -0.8904536 -0.9781217 -0.7589986 -0.6741149 -1.6344382 0.0634176
283 0.1407 -0.0817 -0.7836 -0.7590 -0.6741 -1.6406 126 -0.2117972 -0.0817113 -1.0117870 -0.7744091 -0.6741149 -1.6435656 0.0998475
285 0.5667 0.6795 -0.6356 -0.0501 0.4307 -1.6406 292 0.8575062 0.6794579 -0.6465256 -0.1210066 0.4307274 -1.4053761 0.1013220
308 0.1164 -0.0817 -0.6356 -0.0501 -0.6741 -1.5331 215 0.3343904 -0.0817113 -0.6924924 -0.0501185 -0.6741149 -1.6366099 0.0630729
310 2.7316 -0.8905 0.1513 -0.0501 -0.6741 -1.5331 367 3.0346588 -0.0817113 0.1512706 -0.0501185 -0.6741149 -1.6507259 0.2049227
312 0.6602 -0.0817 -0.7836 -0.0501 -0.6741 -1.5331 215 0.3343904 -0.0817113 -0.6924924 -0.0501185 -0.6741149 -1.6366099 0.0867453
313 1.1729 -0.0817 -0.2850 1.3676 -0.6741 -1.5331 366 1.6169873 0.3532425 -0.2850242 1.3676416 -0.6741149 -1.6521663 0.1663628
320 -0.0394 0.6795 -0.9473 -0.7590 -0.6741 -1.5331 166 0.0277818 0.6794579 -1.0848169 -0.7811511 -0.6741149 -1.6465247 0.0567219
325 -0.9140 -0.8905 -1.2784 -1.1134 -0.6741 -1.5331 6 -1.2106241 -0.8904536 -1.2808087 -1.1134387 -0.6741149 -1.6372821 0.0672191
329 1.5106 0.6795 -0.2850 1.3676 -0.6741 -1.5331 366 1.6169873 0.3532425 -0.2850242 1.3676416 -0.6741149 -1.6521663 0.0919653
330 0.6602 -0.8905 -0.6746 -0.7590 0.4307 -1.5331 165 0.0168493 -0.8904536 -0.6745732 -0.7589986 0.4307274 -1.6944183 0.1341285
332 -0.0394 -0.0817 -1.1031 -0.7590 -0.6741 -1.5331 126 -0.2117972 -0.0817113 -1.0117870 -0.7744091 -0.6741149 -1.6435656 0.0649352
348 0.6533 -0.0817 -0.6356 -0.0501 0.4307 -1.5331 261 0.6126177 -0.0817113 -0.6667822 -0.0501185 0.4307274 -1.3986542 0.0343946
349 0.5148 -0.8905 -0.6356 -0.0501 0.4307 -1.5331 223 0.3675560 -0.8904536 -0.7187220 -0.0501185 0.4307274 -1.6549831 0.0587236
355 -0.2473 0.6795 -0.9473 -0.7590 -0.6741 -1.5331 166 0.0277818 0.6794579 -1.0848169 -0.7811511 -0.6741149 -1.6465247 0.0913719
356 1.0066 -0.0817 -0.6746 -0.7590 0.4307 -1.5331 237 0.4887879 0.1720118 -0.7343040 -0.7589986 0.4307274 -1.5151673 0.1415316
360 0.1338 -0.8905 -0.6551 -0.0501 -0.6741 -1.5331 162 0.0090558 -1.0075566 -0.6679808 -0.0501185 -0.6741149 -1.5682532 0.0483114
380 -1.2604 -1.2710 -1.2784 -1.1134 -0.6741 -1.5331 6 -1.2106241 -0.8904536 -1.2808087 -1.1134387 -0.6741149 -1.6372821 0.0894944
389 -1.4249 -1.2710 -1.2784 -1.1134 -0.6741 -1.5331 6 -1.2106241 -0.8904536 -1.2808087 -1.1134387 -0.6741149 -1.6372821 0.1169111
397 0.4005 -0.8905 -0.7135 -0.7590 -0.6741 -1.5331 121 0.2098101 -0.8904536 -0.8031243 -0.7589986 -0.6741149 -1.6966589 0.0739893
399 -0.5521 -0.8905 -0.7836 -0.7590 -0.6741 -1.1163 104 -0.4639412 -0.9750280 -0.8736759 -0.7589986 -0.6741149 -1.1163339 0.0438022
405 1.1642 -0.8905 0.1513 1.3676 -0.6741 -1.1163 301 0.8225694 -1.0288480 0.1512706 1.3676416 -0.6741149 -0.8450132 0.1252252
406 -0.0481 -0.8905 -0.7759 -0.7590 0.4307 -1.1163 165 0.0168493 -0.8904536 -0.6745732 -0.7589986 0.4307274 -1.6944183 0.1240783
417 -0.3858 -0.8905 -0.6356 -0.0501 -0.6741 -1.1163 142 -0.5366611 -0.8904536 -0.7555993 -0.0501185 -0.6741149 -0.7331850 0.1090092
424 0.4801 -0.0817 -0.6356 -0.0501 0.4307 -1.1163 261 0.6126177 -0.0817113 -0.6667822 -0.0501185 0.4307274 -1.3986542 0.0743519
425 0.3243 -0.0817 -0.6356 -0.0501 -0.6741 -1.1163 215 0.3343904 -0.0817113 -0.6924924 -0.0501185 -0.6741149 -1.6366099 0.0978896
430 0.1338 -1.2710 -0.2850 -0.0501 -0.6741 -1.1163 201 0.3305203 -0.9782808 -0.2940138 -0.0501185 -0.6741149 -0.8484916 0.1277159
437 1.0170 -0.8905 -0.6356 -0.0501 2.6404 -1.1163 343 0.8693078 -0.8904536 -0.6467482 -0.3539243 2.6404120 -1.4466679 0.1321818
446 0.6602 -0.8905 -0.6551 1.3676 0.4307 -1.1163 345 0.8334330 -0.4860824 -0.6550957 1.3676416 0.4307274 -1.5868677 0.1747153
452 0.3069 -0.8905 -0.6356 -0.0501 -0.6741 -1.1163 187 0.4268397 -0.8904536 -0.7057371 -0.0501185 -0.6741149 -1.6271992 0.1168426
454 0.3589 0.6795 -0.6356 -0.0501 0.4307 -1.1163 292 0.8575062 0.6794579 -0.6465256 -0.1210066 0.4307274 -1.4053761 0.1449307
455 0.5148 -0.8905 -0.6356 -0.0501 0.4307 -1.1163 226 0.3234717 -0.9596508 -0.4443852 -0.0501185 0.4307274 -0.8902333 0.1129678
463 2.2120 0.6795 0.3383 -0.0501 -0.6741 -1.1163 337 2.4596730 0.6794579 0.2129491 -0.0501185 -0.6741149 -0.8676232 0.1036294
477 -0.3165 -0.8905 -0.6356 -0.0501 -0.6741 -1.1163 142 -0.5366611 -0.8904536 -0.7555993 -0.0501185 -0.6741149 -0.7331850 0.1205592
488 0.6602 -0.0817 -0.0318 -0.7590 0.4307 -1.1163 237 0.4887879 0.1720118 -0.7343040 -0.7589986 0.4307274 -1.5151673 0.2544207
493 -0.0394 -0.0817 -0.7759 -0.7590 -0.6741 -1.1163 152 -0.0460759 -0.0817113 -0.8264972 -0.7589986 -0.6741149 -0.8676232 0.0509962
495 0.8178 -0.0817 -0.2850 -0.0501 0.4307 -1.1163 261 0.6126177 -0.0817113 -0.6667822 -0.0501185 0.4307274 -1.3986542 0.1448960
497 1.3374 -0.0817 0.1513 1.3676 0.4307 -1.1163 370 1.5474644 0.6794579 -0.3757335 1.3676416 0.4307274 -1.0222272 0.2653996
hist(Act_pred_Table$diff, breaks = 20, col = "blue", main = "Mean Absolute Difference", xlab = "Difference",xlim = c(0,0.6), ylim = c(0,250))
Figure 27: Mean Absolute Difference

Figure 27: Mean Absolute Difference

9. Executive Summary

10. Applications

  1. Pricing Segmentation - The package can be used to discover groups of similar customers based on the customer spend pattern and understand price sensitivity of customers

  2. Market Segmentation - The package can be helpful in market segmentation where we have to identify micro and macro segments. The method used in this package can do both kinds of segmentation in one go

  3. Anomaly Detection - This method can help us categorize system behavior over time and help us find anomaly when there are changes in the system. For e.g. Finding fraudulent claims in healthcare insurance

  4. The package can help us understand the underlying structure of the data. Suppose we want to analyze a curved surface such as sphere or vase, we can approximate it by a lot of small low-order polygons in the form of tessellations using this package

  5. In biology, Voronoi diagrams are used to model a number of different biological structures, including cells and bone microarchitecture

  6. Using the base idea of Systems Dynamics, these diagrams can also be used to depict customer state changes over a period of time

11. References

  1. Topology Preserving Maps : https://users.ics.aalto.fi/jhollmen/dippa/node9.html

  2. Vector Quantization : https://en.wikipedia.org/wiki/Vector_quantization

  3. K-means : https://en.wikipedia.org/wiki/K-means_clustering

  4. Sammon’s Projection : https://en.wikipedia.org/wiki/Sammon_mapping

  5. Voronoi Tessellations : https://en.wikipedia.org/wiki/Centroidal_Voronoi_tessellation

  6. Embedding : https://en.wikipedia.org/wiki/Embedding